Introduction to the Special Issue on Computational Linguistics Using Large Corpora

نویسندگان

Kenneth Ward Church

Robert L. Mercer

چکیده

The 1990s have witnessed a resurgence of interest in 1950s-style empirical and statistical methods of language analysis. Empiricism was at its peak in the 1950s, dominat ing a broad set of fields ranging from psychology (behaviorism) to electrical engineering (information theory). At that time, it was common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their cooccurrence with other words. Firth, a leading figure in British linguistics during the 1950s, summar ized the approach with the memorable line: "You shall know a word by the company it keeps" (Firth 1957). Regrettably, interest in empiricism faded in the late 1950s and early 1960s with a number of significant events including Chomsky 's criticism of n-grams in Syntactic Structures (Chomsky 1957) and Minsky and Papert 's criticism of neural networks in Perceptrons (Minsky and Papert 1969). Perhaps the most immediate reason for this empirical renaissance is the availability of massive quantities of data: more text is available than ever before. Just ten years ago, the one-million word Brown Corpus (Francis and Ku~era, 1982) was considered large, but even then, there were much larger corpora such as the Birmingham Corpus (Sinclair et al. 1987; Sinclair 1987). Today, many locations have samples of text running into the hundreds of millions or even billions of words. Collections of this magni tude are becoming widely available, thanks to data collection efforts such as the Association for Computat ional Linguistics' Data Collection Initiative (ACL/DCI), the European Corpus Initiative (ECI), ICAME, the British National Corpus (BNC), the Linguistic Data Consort ium (LDC), the Consort ium for Lexical Research (CLR), Electronic Dictionary Research (EDR), and standardization efforts such as the Text Encoding Initiative (TEI). 1 The data-intensive approach to language, which is becoming known as Text Analysis, takes a pragmatic approach that is well suited to meet the recent emphasis on numerical evaluations and concrete deliverables. Text Analysis focuses on broad ( though possibly superficial) coverage of unrestricted text, rather than deep analysis of (artificially) restricted domains.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Translation and contrastive linguistic studies at the interface of English and Chinese: Significance and implications

Corpora have revolutionized nearly all areas of linguistic research over the past four decades (McEnery, Xiao and Tono 2006; McEnery and Hardie 2012). Translation studies and contrastive linguistics are no exceptions. Indeed, the rapid development of bilingual parallel corpora as well as monolingual and multilingual comparable corpora since the early 1990s has been of particular relevance and c...

متن کامل

Dictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application

The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...

متن کامل

Introduction to the special issue: On wordnets and relations

Since its inception a quarter century ago, Princeton WordNet [PWN] (Miller 1995; Fellbaum 1998) has had a profound influence on research and applications in lexical semantics, computational linguistics and natural language processing. The numerous uses of this lexical resource have motivated the building of wordnets in several dozen languages, including even a ‘‘dead’’ language, Latin. This spe...

متن کامل

Special Issue Introduction: Semantic Role Labeling: An Introduction to the Special Issue

Semantic role labeling, the computational identification and labeling of arguments in text, has become a leading task in computational linguistics today. Although the issues for this task have been studied for decades, the availability of large resources and the development of statistical machine learning methods have heightened the amount of effort in this field. This special issue presents se...

متن کامل

Introduction to the Special Issue on Finite State Methods in NLP

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computational Linguistics

دوره 19 شماره

صفحات -

تاریخ انتشار 1993

Introduction to the Special Issue on Computational Linguistics Using Large Corpora

نویسندگان

چکیده

منابع مشابه

Translation and contrastive linguistic studies at the interface of English and Chinese: Significance and implications

Dictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application

Introduction to the special issue: On wordnets and relations

Special Issue Introduction: Semantic Role Labeling: An Introduction to the Special Issue

Introduction to the Special Issue on Finite State Methods in NLP

عنوان ژورنال:

اشتراک گذاری